Having worked previously on wines, I chose the red and white wines data set. I combined the two using the rbind() command after removing the variable “X”, adding a variable “type” having a value of 1 for red wines and 0 for white wines. The new data set, wines, have 13 features which include type and quality which were both converted from integer to factor variables. The goal of this exploratory analysis is to determine which features contribute to the most separation of the different values of quality of wines.
## [1] 6497 15
The actual variables are 13. The rest were created from the variable quality to make it an ordered factor variable. Two of the 13 variables (type and quality) are output variables.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "type" "quality.f" "quality.o"
## 'data.frame': 6497 obs. of 15 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ type : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ quality.f : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ quality.o : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## [1] "3" "4" "5" "6" "7" "8" "9"
## [1] "0" "1"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median :0.04700 Median : 29.00 Median :118.0
## Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :0.61100 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
##
## quality type quality.f quality.o
## Min. :3.000 0:4898 3: 30 3: 30
## 1st Qu.:5.000 1:1599 4: 216 4: 216
## Median :6.000 5:2138 5:2138
## Mean :5.818 6:2836 6:2836
## 3rd Qu.:6.000 7:1079 7:1079
## Max. :9.000 8: 193 8: 193
## 9: 5 9: 5
Most of the data come from white wines (1599 red wine and 4898 white wine observations). Features are physicochemical tests of wines. These are fixed acidity (g tartaric acid/L), volatile acidity (g acetic acid/L), citric acid (g/L), residual sugar (g/L), chlorides (g sodium chloride/L), free sulfur dioxide (mg/L), density (g/mL), pH, sulphates (potassium sulfate, g/L) and alcohol (% by volume). Quality is an output variable which is a score given by a human test panel and has possible value of 1 to 10 with 10 being the best.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Ignoring unknown parameters: binwidth, bins, pad
White wines and red wines have different distributions in some of the variables. Because of this, it was necessary to analyze the reds separately from the whites.
Some variables have normal distribution and some don’t. It is surprising to see that in some cases, the variable has normal distribution in red wines but not in white wines (for example, chlorides). The most different distribution between white and reds is the citric acid.
For some variables, transformation was create a more normal distribution but for other variables, transformation didn’t change the distribution.
## Warning: Removed 23 rows containing non-finite values (stat_bin).
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Residual sugar is not normally distributed. Transformation using log10 yields something like a bimodal distribution for white wines.
## Warning: Removed 1984 rows containing non-finite values (stat_bin).
Transformation of chlorides variable for the white wines made it a little more normally distributed.
## Warning: Removed 276 rows containing non-finite values (stat_bin).
Free sulfur dioxide for red wine does not look like a normal curve but transformation didn’t result into a normal distribution.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Transformation for the total sulfur dioxide distribution was needed for red wines though it wasn’t needed for the white wines.
## Warning: Removed 6 rows containing non-finite values (stat_bin).
Calculating the ratio of free to total sulfur dioxide created a feature that is normally distributed.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Transformation of density didn’t change the distribution for all wines.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing non-finite values (stat_bin).
pH of all wines have a normal distribution
Transformation of sulphates created a little better distribution.
## List of 1
## $ text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 10
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
## Warning: Removed 10 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Alcohol distribution isn’t normal and transformation didn’t do much to improve the plot though red wines’ distribution seemed to be bimodal.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These plots show the spread of the data in another way, and show that all features have outliers (not all variables are shown).
The resulting “wines”" dataset has 6497 observations and 12 variables after creating it from the red wine and the white wine data sets downloaded from the Udacity project site.
Main features are the quality, type, and probably alcohol. Quality was based on a human test panel. It can be seen from the histograms that most wines were classified as 5, 6 and 7. Only a few made the 8 and the 9 classification (only five 9’s and none of them were reds). There were 30 worst wines classified as 3. None of the wines were classified as 1, 2 and 10. From the boxplots, all features of wines have outliers. Removing some of the outliers in the plot, we can see a better distribution of the features in all of the wines and that red wines usually have a different distribution from those of the white wines.
It makes sense to think that levels of all components of wine can determine its quality. Therefore, I think that aside from alcohol, citric acid, fixed acidity (acetic acid content), sulphates and sulfur dioxide levels will contribute to the quality of wines. Some of the features in the data set are related as will be seen in the bivariate section, so picking which variables among those that are related might be a good idea.
To create some of the plots above, I resorted to making a factor (ordered) variable out of “quality”. I also thought about creating a ratio between the free sulfur dioxide and the total sulfur dioxide ratio and its distribution is different from the individual features as shown in the histogram for “free to total SO2 ratio” above. One can also calculate a ratio of citric acid to fixed acidity, but when I tried this, there was really no new info I could obtain.
A lot of the features didn’t have a normal distribution and transforming them created distributions that approach the normal curve but not totally. Some didn’t change at al.
Fixed acidity is tailing, so transformation was done. The resulting histogram is more normally distributed.
Volatile acidity is skewed to the left and log10 transformation showed the bimodal characteristic of the distribution.
Residual sugar is not normally distributed. Transformation using log10 yielded something like a bimodal distribution.
Chlorides also don’t look normally distributed. Transformation made it look better, but it also revealed a bimodal distribution.
Free sulfur dioxide and total sulfur dioxide had non-normal distribution and transformation didn’t do anything. But when the ratio of the two have a normal distribution.
Transformation of density also didn’t make the distribution better.
Alcohol distribution isn’t normal and transformation didn’t change the distribution.
Regarding tidying the data, the data have no missing data so I didn’t have to manipulate it so as to remove missing data. All I did was combine two csv files, create ordered factor variables and another variable “type”.
Correlation matrix for all wines (Spearman)
Correlation matrix for red wines (Spearman)
Correlation matrix for white wines (Spearman)
For red wines, quality is correlated to volatile acidity, sulphates and alcohol. For white wines, quality is correlated to chlorides, density and alcohol.
There are also correlations existing between some of the input variables.
For example, density and alcohol have a negative correlation, which makes sense because, since alcohol is less dense than water, if there is more alcohol in a mixture, it is expected to have a lower density that the one with less alcohol. Density is also related to residual sugar. The more sugar there is, the more dense a mixture would be.
pH would expectedly be correlated with fixed acidity, volatile acidity and citric acid. So are total and free sulfur dioxide.
So choosing the best features that help the most in classifying wines would be a good idea.
But it is surprising to me that the correlations are sometimes different for the two types of wine.
The following plots explore the correlation of quality and some of the input variables.
## Warning: Removed 23 rows containing non-finite values (stat_summary).
## Warning: Removed 25 rows containing missing values (geom_point).
## Warning: Removed 165 rows containing non-finite values (stat_summary).
## Warning: Removed 165 rows containing missing values (geom_point).
## Warning: Removed 299 rows containing non-finite values (stat_summary).
## Warning: Removed 313 rows containing missing values (geom_point).
## Warning: Removed 22 rows containing non-finite values (stat_summary).
## Warning: Removed 22 rows containing missing values (geom_point).
Correlation among some input variables:
grid.arrange(c1, c2, ncol=2)
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
Initially, I found it hard to extract the most number of correlations if all types wines are analyzed together. But learning how to create the matrix correlation I did above made it easier for me to pinpoint which input variables can contribute to classification of wine quality. These are alcohol, volatile acidity, chlorides and density.
But as I found out different types have different behaviors in different features, I did the analysis separately. Using a correlation matrix again, I was able to find which input variables correlate to quality in both wines. As mentioned above, quality correlates to volatile acidity, sulphates and alcohol for red wines, and chlorides, density and alcohol for white wines. These have the highest values of correlation coefficients (Spearman method).
The correlation matrix shows which input variables are correlated to another input variable. It is really surprising for me to find that sometimes, a pair of input variables may be correlated in red wines and not in white wines and vice versa. I plotted the ones that are correlated in both wines, except for the citric acid vs. fixed acidity, above.
The strongest relationships between the output variable (quality) and input variables is the alcohol content in both wines.
Among input variables, the strongest are density and alcohol; free and total sulfur dioxide; density and residual sugar. The rest of the following pairs are found to have correlations at a lower extent than the pairs just mentioned:
Scatter plots of alcohol vs. other input variables in red and white wines
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 52 rows containing missing values (geom_point).
Classification of all wine by quality:
Analyzing all wines using alcohol, chlorides and volatile acidity:
(Using density did not create a better plot.)
Classification of red wines by quality:
From the correlation matrix above, red wine quality is influenced more by volatile.acidity, alcohol and sulphates.
## Warning: Removed 112 rows containing missing values (geom_point).
## Warning: Removed 63 rows containing missing values (geom_point).
Red wines are more separated than white white wine by these varibles.
Combining all three variables to classify the quality of red wines only:
Since volatile acidity is related to pH (in theory):
Classification of white wines by quality:
To see how white wine quality is influenced by chlorides, density and alcohol, I plot the following.
## Warning: Removed 337 rows containing missing values (geom_point).
## Warning: Removed 160 rows containing missing values (geom_point).
It seems like white wines are not as properly separated by these variables.
Since residual sugar is highly correlated with density in white wines:
## Warning: Removed 97 rows containing missing values (geom_point).
Another attempt using residual sugar/chlorides ratio:
Reading the literature where the data came from, it mentioned that sulphates had the highest input to the classification (using support vector machines):
It looked like this has a better effect in separating the qualities of white wines.
Of all the input variables, alcohol content and volatile acidity and total sulfur dioxide probably best separate red wines from white wines, judging from the scatter plots above. pH and residual sugar definitely cannot determine whether a wine is red or white.
For red wines, using the ratio of the variables that had the highest correlation coefficient with quality increased the separation of the qualities of red wines. It was harder for white wines. The correlation of density with residual sugar helped by swapping the density with residual sugar in the classification of white wines. Other swaps can be done in a similar fashion to see if the qualities of wine can be separated in the two-dimensional plot. Maybe if using a three-dimensional plot, a better classification can be obtained. Separation is easier to view by just looking at the colors of quality values of 5, 6, and 7. Perhaps because these were most represented in the data, they were better classified in the plot.
It is surprising to see that residual sugar didn’t give a high correlation with quality while it was able to help in contributing to the classification of the qualities of white wines. But it is interesting to be able to swap a feature with a feature it correlates with and classification seems to improve.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
In the wines data set, majority of the observations are white wines. Most of the wines fall under the quality values of 5, 6, and 7. There are only a few 9’s and all are white wines.
Correlation matrix of all wines
Wine quality is correlated to volatile acidity, chlorides, density and alcohol (using the Spearman method). However, red wines have different correlations than white wines.
Features that can separate qualities of wines are alcohol content, chlorides, and volatile acidity. Features that classify red wines from white wines are alcohol and total sulfur dioxide or volatile acidity. To classify red wines by quality, the features that contribute most are alcohol, volatile acidity and sulphates. To classify red wines by quality, the features that contribute most are alcohol, chlorides and density or residual sugar.
I obtained the wines data set by combining two csv files and explored the 11 input variables with respect to the output variable (quality) and also to the newly created variable “type”. From initial exploration, I found that the different input variables influence red wine quality differently from the white wine quality. I then resorted to examining the different types of wines separately, although it is really easy to determine which from the input variables contribute to the type of wines (white wines from red wines). Classification of wines can be done by determining the correlation of input variables to quality. The input variables with the highest correlation coefficient with quality were chosen in the multivariate plot to see the separation among the different quality values of red and white wines. Red wines were easily classified while white wines were a little more challenging. Machine learning techniques will probably better to use in analyzing this type of data.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Archived: in R, how do I append two data files? https://kb.iu.edu/d/bcrr
Factor variables http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm
Adding and removing columns from a data frame http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/
Exploratory data analysis and data pre-processing: https://onlinecourses.science.psu.edu/stat857/print/book/export/html/224
Practical Winery & Vineyard Journal (Jan/Feb 2009): http://www.practicalwinery.com/janfeb09/page2.htm
Exploratory Data Analysis on Wine Quality by Bilal Mahmood https://rpubs.com/Bilal_Mahmood/EDA
Wine Quality Analysis: http://rstudio-pubs-static.s3.amazonaws.com/24803_abbae17a5e154b259f6f9225da6dade0.html
Correlation matrix http://www.cookbook-r.com/Graphs/Correlation_matrix/
An introduction to corrplot package https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Diamonds exploration by Chris Saden: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/diamondsExample.html